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Abstract. An overview of the Conquest linear scaling density functional theory 
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performance computing (HPC) platforms. Wc demonstrate that essentially perfect 
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1. Introduction 

Linear scaling approaches to atomistic calculations have their origin in molecular 
dynamics codes with force fields: the idea that, by calculating interactions for each 
atom only within a local part of space, computational effort scales with a local volume 
leads to to efficient increases in system size; this also leads to natural parallelisation 
schemes. It is well known that, for systems with a gap or metals at finite temperature, 
electronic structure is local, and falls off exponentially with distance — summed up in 
Kohn's "nearsightedness" principle[T]. This realisation led to linear scaling tight binding 
methods in the 1980s. Methods for 0{N) or linear-scaling DFT calculations [2] were first 
proposed over 15 years ago[3l HI El El 13 El E] , but it is only in the last five years that 
practical calculations using these methods have begun to appear. The developments 
which have enabled these calculations will be surveyed in detail in Section [2j In brief, 
the algorithms used to find the ground state have converged on a few main methods, 
while there has been more work on the local orbitals used to represent the density 
matrix. Local basis sets, and their efficient implementation and minimisation, are key 
to performance in linear scaling codes. 

Part of the reason for the slow development of practical codes is that parallelisation 
is extremely important. If calculations on tens of thousands or hundreds of thousands 
of atoms are to be performed, this will require hundreds or thousands of processors (or 
cores in the case of multi-core processors, as is becoming almost universal). The efficient 
implementation of linear scaling codes on parallel machines has received attention 
before [Tot ITTl [T2l [ISl [H]; in this paper, we will explore how far scaling can be extended 
efficiently. We find that there is every reason to believe that linear scaling DFT will 
make extremely good use of the hundreds of thousands of cores which are becoming 
available with petascale computer^ In this article, we will consider the performance of 
our linear scaling DFT code Conquest P [101 [ISlIini [HI [TBI [191 ED], but there are other 
linear scaling DFT codes under development, for instance Siesta[2T], OpenMX[22] and 
ONETEP|23]- As will be described in the next section, most linear scaling methods work 
by using a reformulation of DFT in terms of the density matrix, and apply localisation 
constraints to achieve good scaling with system size. 

In the next section, we give an overview of the Conquest methodology, covering 
the fundamental theory as well as details of the implementation. The results section 
forms the central part of the paper, presenting scaling data both with respect to number 
of cores and numbers of atoms. We conclude with a brief look forward. 

I The Jaguar Cray machine instaUed at Oak Ridge National Laboratory in America is the first petaflop 
machine, and has 150,000 cores, while the next-generation supercomputer in Japan, which is scheduled 
for completion in 2011, will have a peak performance of over 10 petaflops and will require several 
hundred thousand cores. 
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2. Methodology 

The ideas underlying Conquest have been presented before P [TOl fTS | [TB I [T7 t [Tg | [1^ 1 [2U] . 
but we will give an overview here for convenience and to help explain the implementation 
details given below; the interested reader is referred to previous publications for full 
details. As is common with many linear-scaling codes, Conquest works directly with 
the density matrix rather than wavef unctions, and writes it in a separable form: 

P(r,r ) = (pia{r)Ki^jf3(pjf3{r'), (1) 

ia,jf3 

where 0iQ,(r) is a strictly local function centred on atom i called a support function; 
multiple support functions on the same atom are notated with a. The support functions 
are not orthogonal, and there is an associated overlap matrix: 

Siajp = J dr0ia(r)0j73(r). (2) 

The density matrix in the basis of support functions is written Kiajp. Locality is imposed 
in Conquest via a spherical cutoff on the support functions i?cut and a distance-based 
criterion on the elements of an auxiliary density matrix from which K is derived. 

For a given set of support functions, the ground state is found by varying the 
elements of K to minimise the energy subject to various conditions: 

(i) Self-consistency between the charge density and potential 

(ii) Correct electron number, = 2Tt[KS] 

(iii) Idempotency of the density matrix 

The first of these conditions is a standard problem within electronic structure, and 
while not trivial, has been widely explored in other contexts [201 IMl [25]. The second is 
relatively easy to impose, and can be incorporated within the minimisation [101 [15] • The 
final condition is extremely hard to impose, and we instead use the ideas of McWeeny [26] 
to impose weak idempotency. By writing K in terms of an auxiliary density matrix 
(ADM), L, we ensure that its eigenvalues lie between and 1 and converge towards 
these extrema as the minimisation proceeds [5l [U [TH [261 [S]: 

K = 3LSL - 2LSLSL. (3) 

This method for achieving idempotency is sometimes known as the ADM or LNV 
method. 

Practically, the localisation on the density matrix is imposed on L, so that 
Lij = 0, |Rj — Rjl > Rl- By using sparse matrices, and carefully constructed sparse 
matrix methods [I2j. the computational time and memory required for minimisation of 
energy with respect to the elements of K (and ultimately L) scale linearly with the 
number of atoms in the system. 

This is one area where Conquest differs from other linear scaling DFT codes 
(though it is not the only linear scaling DFT code to use the ADM method: the 
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ONETEP code[23], for instance, also uses it). The Orbital Minimisation Method 
(OMM) 13, [6l [27] is another variational approach, though it is not commonly used (it 
is implemented in the SIESTA code [28] and used by Tsuchida[29]). Non- variational 
methods commonly used include the divide- and-conquer [3] (D&C) method and the 
trace-correcting family of methods j3Uj). These are the main methods used for the density 
matrix search. 

The representation of the support functions or local orbitals is an important 
problem within linear scaling electronic structure techniques, and is another area where 
Conquest differs from other codes. The codes now available [HI |22l [23l |28] are of two 
types: those that use basis sets akin to plane waves (including blips or B-splines[M]. 
finite element approaches [32| [33| IM] . periodic sine functions[35] and wavelets(36]). 
which allow systematic basis-set convergence; and those that use pseudo-atomic orbitals 
(PAOs) as basis sets[22l EHl ETJ [3H1 EHl SO], for which systematic convergence is usually 
significantly harder, but which have smaller basis sizes. An important feature of our own 
Conquest code[THl[19],|20] is that both types of basis are implemented, and this means 
that rapid, though semi-quantitative calculations can be performed for exploratory 
purposes, but precise calculations are also possible. The support functions are written: 



where Xs{^) is a basis function centred on atom i. 

The quantitative basis set uses blip functions, specifically b-splines, on a cubic 
regular grid defined within the support region for each atoms fST], which can be 
related to a plane-wave energy cutoff; it is, however, perfectly possible to use different 
spacings for different atoms. By increasing the support region radius and the L 
matrix cutoff systematically, it is possible to achieve plane-wave accuracy linear scaling 

calculations[iniEIlllI]. 

The different computational operations in Conquest can be summarised as: 

(i) Matrix multiplication (e.g. Kij = LikSkiLij) 

(ii) Integration on a grid (which is regular, and defined along the simulation cell lattice 
vectors) 

(iii) Basis function operations: analytic integrals or basis-to-grid transformations 

(iv) Fast fourier transforms (performed on the same grid as integration) 

(v) Communication of information between cores 

The parallelisation strategy in Conouest[TI]1 W2\ [T7] relies on the division of the 
computational cell into small groups of atoms (partitions) and integration grid points 
(blocks); typically a partition will contain ~5-20 atoms, and a block will have size of 
3 X 3 X 3 — 8 X 8 X 8 grid points. These are assembled into groups (bundles of partitions and 
domains of blocks) which should be both localised and overlapping for good performance, 
and assigned to cores (for multi-core CPUs). The assembly of domains and bundles, and 
the assignment of these groups to cores can strongly affect the efficiency of the code. We 
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have implemented a default partitioning scheme based on Hilbert curves [13] which allows 
calculations without detailed optimisation of load balancing; examples of the effeciency 
of parallel scaling with this scheme are given below in Sec. |3| Details of partitioning can 
be optimised externally to Conquest, and this allows different approaches to be taken. 
There are some computational cells where the assignment of domains and bundles to 
cores is obvious (for instance the cubic cells used for scaling tests up to millions of 
atoms), and a simple script will allow the optimal distribution to be created. We also 
have an optimising code which uses simulated annealing to load-balance the system. 

Conquest can operate at different levels of accuracy, depending on the basis set 
chosen, and other factors. If PAOs are used, with only a minimal basis set and no 
self consistency, then we have non-self-consistent ah initio tight binding (NSC-AITB). 
If self-consistency is introduced and the basis set expanded somewhat then the code 
runs at the level of self-consistent AITB. For full PAO basis sets and bhp functions 
with full basis optimisation we achieve full DFT accuracy, and when cutoffs are taken 
to convergence we can recover plane-wave accuracy. Conquest has the PBE GGA 
functional implemented as well as LDA, at all levels[l2]. Forces are calculated exactly 
as derivatives of the energy [T^ W2\ and are implemented at all levels of accuracy, both 
for LDA and GGA 02]. 

Many of the calculations in this paper operate at the NSC-AITB level, as this uses 
the full functionality of the code and permits good scaling tests. This does not mean, 
however, that this is how we anticipate using the code; indeed, we have performed self- 
consistent calculations on cells up to 262,144 atoms (described below) with no decrease 
in the scaling. Optimisation of support functions scales in a similar manner. 

3. Results 

The results in this section are intended to demonstrate the scaling performance of 
the Conquest code. We have recently used the code for a series of calculations on 
the Ge(105) surface[l3] and on the energetics of self-assembly of Ge hut clusters on 
Si(001)|13j, and we draw many example systems from these studies. Details of the 
systems are given in the papers already published. Linear scaling methods are also 
ideal for application to ionic materials (which often have large band gaps) and we have 
successfully performed exploratory self-consistent calculations on MgO surfaces with 
defects. We are also using Conquest to perform calculations on other systems, such 
as biomolecules, and we are actively pursuing this area|12l SSI SS] 

We start by considering the strong scaling performance on the current UK HPC 
facility HECToR (a Cray XT4): that is increasing the number of cores used in a 
calculation while keeping the system size fixed. For these tests, we have used two of 
the Ge hut cluster systems, containing 11,620 and 22,746 atoms respectively. The unit 
cells are far from cubic, which presents a non-ideal situation for the default partitioner 
which uses a 3D Hilbert curve and performs best for cells close to cubic|§] The smaller 

§ We are developing improvements to the partitioner to allow us to move away from this restriction, 
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Number of cores Number of cores 

Figure 1. Scaling using automated Hilbert partitioning for two different hut clusters, 
(a) Hut cluster with 11,620 atoms on 16 — 512 cores (increasing by 32 times); (b) Hut 
cluster with 22,746 atoms on 64 — 288 cores (increasing by 4.5 times). 

system allows somewhat better assignment of atoms to cores, and we have tested the 
parallel scaling more extensively on this system. 

Results are shown in Fig. [l] In Fig. [ij^a) we show the speed up of the code for the 
smaller hut cluster, as we increase the number of cores by a factor of up to 32. For 
an increase of a factor of up to 8, the scaling is excellent with an efficiency (defined as 
speed-up measured divided by increase in number of cores) of about 80%. For further 
increases in numbers of cores we see smaller efficencies, but the efficiency is still over 
60%. In Fig. [T|b) we show the scaling for the larger hut cluster, as the number of cores 
is increased by a factor of up to 4.5. This scaling is excellent, and remains at over 90%. 

We can understand the strong scaling behaviour from the parallelisation strategy. 
The main computational load in Conquest is the sparse matrix multiplication, which 
we have optimised extensively [12]. The time required for multiplies scales with the 
number of neighbours of each atom as well as the number of atoms per core; in the hut 
cluster system shown above, some atoms are near the surface of the system with fewer 
neighbours, while others are in the bulk with more neighbours. With less than 20 atoms 
per core, it is rather hard to achieve good load balancing. Good load balancing also 
requires that the bundles of atoms assigned to cores are compact, and this is difficult to 
achieve with small numbers of atoms per core. Also, as the number of atoms per core 
decreases, communications overhead will start to dominate. This behaviour is clearly 
seen in Fig. [l| where there are only ~20 atoms/core at the largest number of cores. The 
most efficient results are seen for 40 or more atoms per core. 

In practical calculations, we usually increase the number of cores proportionately 
with the number of atoms in the system. Therefore, more realistic tests of the scaling 

but these are still at a preliminary stage. 
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Figure 2. Scaling on LCN cluster (dual core Opteron, Myrinet interconnect). Main 
graph shows time per core for different atoms per core (reaching 4,096 atoms on 32 
cores); inset shows the change of time/core with increasing numbers of atoms/core, 
with a linear increase for the average shown with a dashed line. 



are to fix the number of atoms per core, and increase tlie number of cores at the same 
time as increasing the number of atoms; this is known as weak scahng. Results for 
this mode of operation on a local HPC cluster (based on dual-processor, dual-core Sun 
servers with Myrinet interconnects) are shown in Fig. |2} The system being tested is 
bulk silicon, which while not scientifically interesting, is simple to prepare and contains 
all the essential physics we wish to test. The main graph shows the time per core 
plotted against number of atoms in the system for different numbers of atoms per core: 
32 atoms/core, 64 atoms/core and 128 atoms/core. A number of important points 
come out of this plot: first, the time per core is effectively constant for the systems 
considered; second, communication becomes unimportant for 64 atoms/core or more 
(as seen in Fig. [l]as well); third, as shown in the inset, the linear scaling performance 
of the code is excellent, lying on the ideal linear curve. 

Finally, we are concerned to show that this good scaling behaviour persists to 
extremely large systems, so we have taken a system which can be easily partitioned and 
scaled and scaled to over 4,000 cores and over 2,000,000 atoms. We show increase in 
total time (i.e. time/core summed over cores) vs increase in system size, as well as total 
time and total energy plotted against number of atoms in Fig. |3] These were run on 
HECToR, using between 8 and 4,096 cores with 512 atoms/core, giving 2,097,152 atoms 
as the largest cell considered. Details of times, energies and numbers of cores are given 
in Table [l[ 



The most important result from this calculation is that DFT calculations on millions 
of atoms are now possible. We see that the time per core does not increase with system 
size, and that the energy per atom is constant. There are parts of the Conquest code 
which are not strictly linear scaling: we use an Ewald sum for electrostatic interactions 

II The grid spacing used was a little coarser than we would normally choose, to reduce memory 
requirements; however, we note that this will not affect the convergence or scaling, and have tested the 
smaller systems with finer grids to ensure that there is no effect from this. 




Figure 3. Linear and parallel scaling for bulk silicon on 512 — 4096 cores. The insets 
show total time and total energy (made positive to enable log plot) while main graph 
shows increase in time with system size. Exact data values are given in Table [l] 

Atoms Time/core (s) Total energy (Ha) Cores 

4,096 7068.878 -308.268785 8 

32,768 6893.759 -2,466.150282 64 

262,144 6931.418 -19,729.202254 512 

2,097,152 7032.496 -157,833.618033 4096 

Table 1. Times and energies for Conquest runs with 512 atoms/core. The energy per 
atom takes a constant value of 0.075261 Ha. 



(which can be easily replaced with a scheme such as the neutral atom potential PSIIST] ) 
which scales as 0{N^^'^) and fast fourier transforms which scale as A^log(A^), but even for 
the 2,097,152 atom unit cell these are negligible (approximately 3s for all FFT-related 
work and 50s for Ewald sum). We note that orbital-free DFT calculations have recently 
been performed on a supercell of 1,012,500 atoms of bulk Al|47]. 

We have also performed self-consistent 0(N) calculations on this system (actually 
for the first three cells), and find that they require four to five times as long, with 
our current implementation (in this case, the variational nature of the minimisation 
means that, as self-consistency is approached, less time is spent finding the density 
matrix); the scaling of the code when performing self-consistent calculations, is identical 
to non-self-consistent calculations. The main challenge now is to improve the efficiency 
of the code, and reduce the number of atoms per core which can be run without 
communciations becoming a heavy burden. We will focus on three main areas: first, 
efficient re-use of variational data such as the L matrix to reduce the time to the ground 
state; second, robustness and stability of the calculations; and finally, more efficient 
automatic partitioning [13]. This will allow us to consider real scientific problems which 
require tens or hundreds of thousands of atoms, and to perform molecular dynamics 
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simulations using petascale computer platforms. 
4. Conclusions 

Linear scaling approaches to DFT have been under development for about fifteen years, 
and are now starting to show their promise in real calculations, and in their applicability 
to petascale computers. This special issue of Journal of Physics: Condensed Matter is 
in honour of Professor Mike Gillan's 65th birthday, and it is appropriate to celebrate 
the considerable contribution which he has made to the development of linear scaling 
DFT, both through the theory and implementation [9l HDlinillSllISlIIZlIIHllIHlEOlEIl 
SSI mi HSl SHI Hni En] • The results in this paper show that linear scaling DFT is realising 
its potential, and that Mike Gillan's contributions have underpinned all that has gone 
on in the field. 

It has been noted [5U] that applications of linear scaling methods to real problems 
are starting to emerge; the challenge now is to make linear scaling methods sufficiently 
robust and efficient that they can be used as routinely as standard DFT methods, and 
to find applications which demonstate their power. Examples of applications of these 
methods include work on DNA with Siesta[51] and Conquest [15], biomolecules with 
ONETEP|52] and Conquest[12], HH] and our work on Ge on Si(OOl) with Conquest, 
extending to over 20,000 atoms|l3l HI]. Among other applications, we intend to extend 
the Ge work to the transition from hut clusters to domes, as well as applying Conquest 
to biomolecules [IH]. The code will also be released under a GPL licence in the near 
future. 
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